Project: Investigate TMDb Movies Dataset

Table of Contents

Introduction

This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

Research Questions

1-Which genres are most popular from year to year?
2-Which year has the highest release of movies?
3-Which movies are the most popular of all time?
4-Which movie title had the highest budget?
5-Does a bigger film production budget result in more popularity?

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

Data Wrangling

Tip: In this section of the report, I will load in the data, check for cleanliness, and then trim and clean the dataset for analysis.

General Properties

In [3]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
df = pd.read_csv('tmdb-movies.csv')
df.head(10)
Out[3]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 06/09/2015 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. ... Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home ... Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 04/01/2015 2947 7.3 2015 1.747999e+08 1.385749e+09
5 281957 tt1663202 9.110700 135000000 532950503 The Revenant Leonardo DiCaprio|Tom Hardy|Will Poulter|Domhn... http://www.foxmovies.com/movies/the-revenant Alejandro González Iñárritu (n. One who has returned, as if from the dead.) ... In the 1820s, a frontiersman, Hugh Glass, sets... 156 Western|Drama|Adventure|Thriller Regency Enterprises|Appian Way|CatchPlay|Anony... 12/25/15 3929 7.2 2015 1.241999e+08 4.903142e+08
6 87101 tt1340138 8.654359 155000000 440603537 Terminator Genisys Arnold Schwarzenegger|Jason Clarke|Emilia Clar... http://www.terminatormovie.com/ Alan Taylor Reset the future ... The year is 2029. John Connor, leader of the r... 125 Science Fiction|Action|Thriller|Adventure Paramount Pictures|Skydance Productions 6/23/15 2598 5.8 2015 1.425999e+08 4.053551e+08
7 286217 tt3659388 7.667400 108000000 595380321 The Martian Matt Damon|Jessica Chastain|Kristen Wiig|Jeff ... http://www.foxmovies.com/movies/the-martian Ridley Scott Bring Him Home ... During a manned mission to Mars, Astronaut Mar... 141 Drama|Adventure|Science Fiction Twentieth Century Fox Film Corporation|Scott F... 9/30/15 4572 7.6 2015 9.935996e+07 5.477497e+08
8 211672 tt2293640 7.404165 74000000 1156730962 Minions Sandra Bullock|Jon Hamm|Michael Keaton|Allison... http://www.minionsmovie.com/ Kyle Balda|Pierre Coffin Before Gru, they had a history of bad bosses ... Minions Stuart, Kevin and Bob are recruited by... 91 Family|Animation|Adventure|Comedy Universal Pictures|Illumination Entertainment 6/17/15 2893 6.5 2015 6.807997e+07 1.064192e+09
9 150540 tt2096673 6.326804 175000000 853708609 Inside Out Amy Poehler|Phyllis Smith|Richard Kind|Bill Ha... http://movies.disney.com/inside-out Pete Docter Meet the little voices inside your head. ... Growing up can be a bumpy road, and it's no ex... 94 Comedy|Animation|Family Walt Disney Pictures|Pixar Animation Studios|W... 06/09/2015 3935 8.0 2015 1.609999e+08 7.854116e+08

10 rows × 21 columns

In [4]:
df.shape
Out[4]:
(10866, 21)
In [5]:
df.describe()
Out[5]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
In [7]:
# hist of the date before cleaning
df.hist(figsize=(15,10));

Data Cleaning

Tip: Cleaning the data by drop the duplicated rows and drop the nan value if it needs

In [8]:
df.duplicated().sum()
Out[8]:
1
In [9]:
#drop duplicates
df.drop_duplicates(inplace =True);
In [10]:
#be sure duplicated removed
df.duplicated().sum()
Out[10]:
0
In [11]:
df.shape
Out[11]:
(10865, 21)
In [13]:
df.isnull().sum()
Out[13]:
id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7929
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64
In [14]:
df[df.cast.isnull()]
Out[14]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
371 345637 tt4661600 0.422901 0 0 Sanjay's Super Team NaN NaN Sanjay Patel NaN ... Sanjay's Super Team follows the daydream of a ... 7 Animation Pixar Animation Studios 11/25/15 47 6.9 2015 0.000000 0.0
441 355020 tt4908644 0.220751 0 0 Winter on Fire: Ukraine's Fight for Freedom NaN http://www.netflix.com/title/80031666 Evgeny Afineevsky The Next Generation Of Revolution ... A documentary on the unrest in Ukraine during ... 98 Documentary Passion Pictures|Campbell Grobman Films|Afinee... 10/09/2015 37 8.2 2015 0.000000 0.0
465 321109 tt4393514 0.201696 0 0 Bitter Lake NaN NaN Adam Curtis NaN ... An experimental documentary that explores Saud... 135 Documentary BBC 1/24/15 19 7.8 2015 0.000000 0.0
536 333350 tt3762974 0.122543 0 0 A Faster Horse NaN NaN David Gelb NaN ... David Gelb (Jiro Dreams of Sushi) tackles anot... 90 Documentary NaN 10/08/2015 12 8.0 2015 0.000000 0.0
538 224972 tt3983674 0.114264 0 0 The Mask You Live In NaN http://themaskyoulivein.org Jennifer Siebel Newsom Is american masculinity harming our boys, men ... ... Compared to girls, research shows that boys in... 88 Documentary NaN 01/01/2015 11 8.9 2015 0.000000 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9677 13926 tt0093832 0.253376 0 0 Red's Dream NaN NaN John Lasseter NaN ... Life as the sole sale item in the clearance co... 4 Animation Pixar Animation Studios 8/17/87 44 6.6 1987 0.000000 0.0
9755 48714 tt0061402 0.046272 0 0 The Big Shave NaN NaN Martin Scorsese NaN ... This short film is a metaphor for the Vietnam ... 6 Drama NaN 01/01/1968 12 6.7 1968 0.000000 0.0
10434 48784 tt0060984 0.146906 200 0 Six Men Getting Sick NaN NaN David Lynch NaN ... Lynch's first film project consists of a loop ... 4 Animation Pensylvania Academy of Fine Arts 01/01/1967 16 5.2 1967 1307.352748 0.0
10550 13925 tt0091455 0.306425 0 0 Luxo Jr. NaN http://www.pixar.com/short_films/Theatrical-Sh... John Lasseter NaN ... A baby lamp finds a ball to play with and it's... 2 Animation|Family Pixar Animation Studios 8/17/86 81 7.3 1986 0.000000 0.0
10754 3171 tt0064064 0.002757 0 0 Bambi Meets Godzilla NaN NaN Marv Newland NaN ... Bambi is nibbling the grass, unaware of the up... 2 Animation|Comedy NaN 01/01/1969 12 5.6 1969 0.000000 0.0

76 rows × 21 columns

In [17]:
df['tagline'].value_counts()
Out[17]:
Based on a true story.                                                                                                         5
Be careful what you wish for.                                                                                                  3
Two Films. One Love.                                                                                                           3
Some things are better left buried.                                                                                            2
Inspect the unexpected.                                                                                                        2
                                                                                                                              ..
I warned you not to go out tonight.                                                                                            1
When U. S. Bates told his son he could have any present he wanted, he picked the most outrageous gift of all... Jack Brown.    1
Passion, Betrayal, Revenge, A hostile takeover is underway.                                                                    1
One of the most legendary directors of our time takes you on an extraordinary adventure.                                       1
May the Lord have mercy and grant you a swift death                                                                            1
Name: tagline, Length: 7997, dtype: int64
In [18]:
df.dropna(inplace=True)
In [19]:
df.isnull().sum()
Out[19]:
id                      0
imdb_id                 0
popularity              0
budget                  0
revenue                 0
original_title          0
cast                    0
homepage                0
director                0
tagline                 0
keywords                0
overview                0
runtime                 0
genres                  0
production_companies    0
release_date            0
vote_count              0
vote_average            0
release_year            0
budget_adj              0
revenue_adj             0
dtype: int64
In [21]:
df.shape
Out[21]:
(1992, 21)
In [22]:
df.hist(figsize=(15,10));
In [23]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1992 entries, 0 to 10819
Data columns (total 21 columns):
id                      1992 non-null int64
imdb_id                 1992 non-null object
popularity              1992 non-null float64
budget                  1992 non-null int64
revenue                 1992 non-null int64
original_title          1992 non-null object
cast                    1992 non-null object
homepage                1992 non-null object
director                1992 non-null object
tagline                 1992 non-null object
keywords                1992 non-null object
overview                1992 non-null object
runtime                 1992 non-null int64
genres                  1992 non-null object
production_companies    1992 non-null object
release_date            1992 non-null object
vote_count              1992 non-null int64
vote_average            1992 non-null float64
release_year            1992 non-null int64
budget_adj              1992 non-null float64
revenue_adj             1992 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 342.4+ KB
In [24]:
df.describe()
Out[24]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 1992.000000 1992.000000 1.992000e+03 1.992000e+03 1992.000000 1992.000000 1992.000000 1992.000000 1.992000e+03 1.992000e+03
mean 71652.152108 1.316763 3.454924e+07 1.152153e+08 106.040161 643.616968 6.178614 2007.796687 3.627376e+07 1.302391e+08
std 92355.883915 1.873563 5.061878e+07 2.202887e+08 29.234592 1092.355998 0.881955 7.549224 5.129783e+07 2.564338e+08
min 11.000000 0.000620 0.000000e+00 0.000000e+00 0.000000 10.000000 2.100000 1961.000000 0.000000e+00 0.000000e+00
25% 9699.000000 0.384079 0.000000e+00 0.000000e+00 92.000000 51.000000 5.600000 2006.000000 0.000000e+00 0.000000e+00
50% 35112.500000 0.774223 1.500000e+07 2.578782e+07 102.000000 210.000000 6.200000 2010.000000 1.524601e+07 2.806370e+07
75% 83573.000000 1.538639 4.800000e+07 1.278787e+08 116.000000 688.250000 6.800000 2012.000000 5.064450e+07 1.393645e+08
max 414419.000000 32.985763 4.250000e+08 2.781506e+09 705.000000 9767.000000 8.300000 2015.000000 4.250000e+08 2.827124e+09

Exploratory Data Analysis

Tip: Now the data is cleaned, so lets make some exploration. Compute statistics and create visualizations

we will work on the types of genres and get the popularity of each type so we can get the most popular

In [25]:
genres = df['genres'].value_counts()
genres
Out[25]:
Drama                                           127
Comedy                                          105
Drama|Romance                                    52
Documentary                                      51
Horror|Thriller                                  50
                                               ... 
Drama|Mystery|Crime                               1
Action|Comedy|Adventure                           1
Fantasy|Drama|Comedy|Science Fiction|Romance      1
Animation|Drama|Science Fiction|Thriller          1
Animation|Drama                                   1
Name: genres, Length: 682, dtype: int64
In [45]:
genres.plot(kind='barh',figsize=(7,100))
plt.ylabel('geners')
plt.xlabel('the values of geners')
plt.title('popularity of genres')
Out[45]:
Text(0.5, 1.0, 'popularity of genres')
In [58]:
def separate_count(column):
    split_data = pd.Series(df[column].str.cat(sep = '|').split('|'))
    
    count_data = split_data.value_counts(ascending=False)
    return count_data
In [59]:
# Plot pie relationship between genre and number of movies
separate_count("genres").plot(kind="pie",figsize=(9,9),autopct="%1.1f%%")
# the title of the plot
plt.title('Percentage Of Genres')
plt.ylabel('');

Conclusions

Tip: Drama is the most popular

Research Question 2 ( Which year has the highest release of movies?)

the highest release year is 2011

In [27]:
release = df['release_year'].value_counts()
release
Out[27]:
2011    219
2010    206
2009    192
2015    165
2014    153
2012    145
2008    142
2007    135
2013    128
2006     92
2005     72
2004     48
2003     40
2002     31
1999     24
2000     20
2001     19
1996     15
1998     13
1997     11
1995     10
1993     10
1994      8
1987      8
1981      6
1990      6
1983      6
1979      5
1984      5
1989      5
1992      5
1985      4
1988      4
1978      4
1971      4
1982      3
1991      3
1977      3
1975      3
1973      2
1964      2
1974      2
1986      2
1976      2
1980      2
1972      1
1970      1
1969      1
1967      1
1965      1
1963      1
1962      1
1961      1
Name: release_year, dtype: int64
In [46]:
release.plot(kind='bar',figsize=(15,5))
plt.xlabel('years')
plt.ylabel('the value of realse')
plt.title('Number of movies in each year')
Out[46]:
Text(0.5, 1.0, 'Number of movies in each year')

I think the most popular movie that have the max popularity

In [29]:
df['popularity'].max()
Out[29]:
32.985763
In [30]:
df[df['popularity']==32.985763]
Out[30]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 06/09/2015 5562 6.5 2015 137999939.3 1.392446e+09

1 rows × 21 columns

Conclusions

Tip: Jurassic World is the most popular movie

Research Question 4 ( Which movie title had the highest budget??)

We will work on two variable ('budget') and ('original_title') , so we will get the max of budget

In [31]:
df[df['budget']==df['budget'].max()]
Out[31]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
2244 46528 tt1032751 0.25054 425000000 11087569 The Warrior's Way Kate Bosworth|Jang Dong-gun|Geoffrey Rush|Dann... http://www.iamrogue.com/thewarriorsway Sngmoo Lee Assassin. Hero. Legend. ... An Asian assassin (Dong-gun Jang) is forced to... 100 Adventure|Fantasy|Action|Western|Thriller Boram Entertainment Inc. 12/02/2010 74 6.4 2010 425000000.0 11087569.0

1 rows × 21 columns

In [32]:
df['budget'].describe()
Out[32]:
count    1.992000e+03
mean     3.454924e+07
std      5.061878e+07
min      0.000000e+00
25%      0.000000e+00
50%      1.500000e+07
75%      4.800000e+07
max      4.250000e+08
Name: budget, dtype: float64
In [72]:
plt.pie(df['budget'],labels=df['original_title']);
plt.title('budget')
plt.legend(df['original_title'])
plt.show()

Conclusions

Tip: The Warrior's Way has the highest budget

Research Question 5 (Does a bigger film production budget result in more popularity?)

we will see the effect of the variation of budget on popularity .
so we will work on popularity and budget

In [43]:
sns.regplot(x=df['budget'],y=df['popularity']).set_title('relation between budget and popularity')
Out[43]:
Text(0.5, 1.0, 'relation between budget and popularity')

Conclusions

Tip: there are a positive relation between budget and popularity with few exceptions

limitations.

Tip:
1- there are many nan values in cast coulmn
2- there are many zerose in budget and revenues
3- there are many nan values in tageline
4- duplicated rows
5- incorrect data types

Conclusions

Tip: most popular from year to year is drama
the max number of movies was in 2011
Jurassic World film has the highest popularity
The Warrior's Way film has the highest budget
there are a positive relation between budget and popularity with few exceptions